Chinese Word Segmentation as LMR Tagging
نویسندگان
چکیده
In this paper we present Chinese word segmentation algorithms based on the socalled LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system achieves F-scores of and on the Academia Sinica corpus and the Hong Kong City University corpus respectively. 1 Segmentation as Tagging Unlike English text in which sentences are sequences of words delimited by white spaces, in Chinese text, sentences are represented as strings of Chinese characters or hanzi without similar natural delimiters. Therefore, the first step in a Chinese language processing task is to identify the sequence of words in a sentence and mark boundaries in appropriate places. This may sound simple enough but in reality identifying words in Chinese is a non-trivial problem that has drawn a large body of research in the Chinese language processing community (Fan and Tsai, 1988; Gan et al., 1996; Sproat et al., 1996; Wu, 2003; Xue, 2003). The key to accurate automatic word identification in Chinese lies in the successful resolution of ambiguities and a proper way to handle out-of-vocabulary words. The ambiguities in Chinese word segmentation is due to the fact that a hanzi can occur in different word-internal positions (Xue, 2003). Given the proper context, generally provided by the sentence in which it occurs, the position of a hanzi can be determined. In this paper, we model the Chinese word segmentation as a hanzi tagging problem and use a machine-learning algorithm to determine the appropriate position for a hanzi. There are several reasons why we may expect this approach to work. First, Chinese words generally have fewer than four characters. As a result, the number of positions is small. Second, although each hanzi can in principle occur in all possible positions, not all hanzi behave this way. A substantial number of hanzi are distributed in a constrained manner. For example, , the plural marker, almost always occurs in the word-final position. Finally, although Chinese words cannot be exhaustively listed and new words are bound to occur in naturally occurring text, the same is not true for hanzi. The number of hanzi stays fairly constant and we do not generally expect to see new hanzi. We represent the positions of a hanzi with four different tags (Table 1): LM for a hanzi that occurs on the left periphery of a word, followed by other hanzi, MM for a hanzi that occurs in the middle of a word, MR for a hanzi that occurs on the right periphery of word, preceded by other hanzi, and LR for hanzi that is a word by itself. We call this LMR tagging. With this approach, word segmentation is a process where each hanzi is assigned an LMR tag and sequences of hanzi are then converted into sequences of words based on the LMR tags. The use of four tags is linguistically intuitive in that LM tags morphemes that are prefixes or stems in the absence of prefixes, MR tags morphemes that are suffixes or stems in the absence of suffixes, MM tags stems with affixes and LR tags stems without affixes. Representing the distributions of hanzi with LMR tags also makes it easy to use machine learning algorithms which has been successfully applied to other tagging problems, such as POS-tagging and IOB tagging used in text chunking. Language Processing, July 2003, pp. 176-179. Proceedings of the Second SIGHAN Workshop on Chinese
منابع مشابه
Two-Phase LMR-RC Tagging for Chinese Word Segmentation
In this paper we present a Two-Phase LMR-RC Tagging scheme to perform Chinese word segmentation. In the Regular Tagging phase, Chinese sentences are processed similar to the original LMR Tagging. Tagged sentences are then passed to the Correctional Tagging phase, in which the sentences are re-tagged using extra information from the first round tagging results. Two training methods, Separated Mo...
متن کاملComparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese
Word Segmentation is usually considered an essential step for many Chinese and Japanese Natural Language Processing tasks, such as name tagging. This paper presents several new observations and analysis on the impact of word segmentation on name tagging; (1). Due to the limitation of current state-of-the-art Chinese word segmentation performance, a character-based name tagger can outperform its...
متن کاملCombining Character-Based and Subsequence-Based Tagging for Chinese Word Segmentation
Chinese word segmentation is the initial step for Chinese information processing. The performance of Chinese word segmentation has been greatly improved by character-based approaches in recent years. This approach treats Chinese word segmentation as a character-wordposition-tagging problem. With the help of powerful sequence tagging model, character-based method quickly rose as a mainstream tec...
متن کاملChinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?
Chinese part-of-speech (POS) tagging assigns one POS tag to each word in a Chinese sentence. However, since words are not demarcated in a Chinese sentence, Chinese POS tagging requires word segmentation as a prerequisite. We could perform Chinese POS tagging strictly after word segmentation (one-at-a-time approach), or perform both word segmentation and POS tagging in a combined, single step si...
متن کاملWord Boundary Decision with CRF for Chinese Word Segmentation
Chinese word segmentation systems necessarily perform both accurately and quickly for real applications. In this paper, we study on word boundary decision (WBD) approach for Chinese word segmentation and implement it as a 2-tag character tagging with conditional random filed (CRF). With a help of tag transition features, WBD with CRF segmentation approach can achieve comparative performances co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003